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ABSTRACT 



A dynamic memory allocation routine maintains an alloca- 
tion size cache which records the address of a most recently 
allocated memory block for each different size of memory 
block that has been allocated. Upon receiving a dynamic 
memory allocation request, the dynamic memory allocation 
routine determines if the requested size is equal to one of the 
sizes recorded in the allocation size cache. If a matching size 
is found, the dynamic memory allocation routine attempts to 
allocate a memory block contiguous to the most recently 
allocated memory block of that matching size. If the con- 
tiguous memory block has been allocated to another 
memory block, the dynamic memory allocation routine 
attempts to reserve a reserved memory block having a size 
which is a predetermined multiple of the requested size. The 
requested memory block is then allocated at the beginning of 
the reserved memory block. By reserving the reserved 
memory block, the dynamic memory allocation routine may 
increase the likelihood that subsequent requests for memory 
blocks having the requested size can be allocated in con- 
tiguous memory locations. 

18 Claims, 9 Drawing Sheets 
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DYNAMIC MEMORY ALLOCATION 
SUITABLE FOR STRIDE-BASED 
PREFETCHING 

BACKGROUND OF THE INVENTION 5 

1. Field of the Invention 

This invention is related to dynamic memory allocation 
for computer systems. 

2. Description of the Related Art 

10 

Modern microprocessors are demanding increasing 
memory bandwidth to support the increased performance 
achievable by the microprocessors. Increasing clock fre- 
quencies (i.e. shortening clock cycles) employed by the 
microprocessors allow for more data and instructions to be 15 
processed per second, thereby increasing bandwidth require- 
ments. Furthermore, modern microprocessor microarchitec- 
tures are improving the efficiency at which the micropro- 
cessor can process data and instructions. Bandwidth 
requirements are increased even further due to the improved 2Q 
processing efficiency. 

Computer systems typically have a relatively large, rela- 
tively slow main memory. Typically, multiple dynamic ran- 
dom access memory (DRAM) modules comprise the main 
memory system. The large main memory provides storage 2 5 
for a large number of instructions and/or a large amount of 
data for use by the microprocessor, providing faster access 
to the instructions and/or data then may be achieved from a 
disk storage, for example. However, the access times of 
modern DRAMs are significantly longer than the clock cycle 30 
length of modern microprocessors. The memory access time 
for each set of bytes being transferred to the microprocessor 
is therefore long. Accordingly, the main memory system is 
not a low latency system. Microprocessor performance may 
suffer due to the latency of the memory system. 35 

In order to increase performance, microprocessors may 
employ prefetching to "guess" which data will be requested 
in the future by the program being executed. If the guess is 
correct, the delay of fetching the data from memory has 
already occurred when the data is requested (i.e. the 40 
requested data may be available within the microprocessor). 
In other words, the effective latency of the data is reduced. 
The microprocessor may employ a cache, for example, and 
the data may be prefetched from memory into the cache. The 
term prefetch, as used herein, refers to transferring data into 45 
a microprocessor (or cache memory attached to the 
microprocessor) prior to a request for the data being gener- 
ated via execution of an instruction within the microproces- 
sor. Generally, prefetch algorithms are based upon the 
pattern of accesses which have been performed in response 50 
to the program being executed. A popular data prefetch 
algorithm is the stride-based prefetch algorithm in which the 
difference between the addresses of consecutive accesses 
(the "stride") is added to subsequent access addresses to 
generate a prefetch address. 55 

Stride -based prefetch algorithms often work well with 
statically allocated data structures. Data structures are stati- 
cally allocated if they are allocated memory at the initiation 
of a program and remain allocated in that same memory 
throughout execution of the program. Because the data 60 
structure is statically allocated, it is generally laid out in 
contiguous memory locations. Stride-based prefetch algo- 
rithms work well because the memory storing the data 
structure is contiguous and the reference patterns are regular. 
A statically allocated array, for example, may be traversed 65 
by reading memory locations which are separated from each 
other by a regular interval. After just a few memory fetches, 
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the stride-based prefetch algorithm may have learned the 
regular interval and may correctly predict subsequent 
memory fetches. 

On the other hand, data structures are dynamically allo- 
cated if the memory for the data structures is allocated and 
deallocated as needed during the execution of the program. 
Dynamically allocated data structures have a variety of 
advantages in programs in which the amount of memory 
needed for the data structure varies widely and is difficult or 
impossible to predict ahead of time. Instead of statically 
allocating a very large amount of memory, the memory is 
allocated as needed. Memory space is thereby conserved. 

Unfortunately, dynamic memory allocation algorithms 
are typically not conducive to prefetch algorithms. Dynamic 
memory allocation algorithms typically employ a "first fit" 
approach in which the first available memory block which 
includes at least the number of bytes requested for allocation 
is selected, or a "best fit" approach in which the available 
memory is scanned for a memory block which is closest in 
size to the requested number of bytes or causes the least 
amount of fragmentation if allocated to the request. These 
approaches select memory locations which may have no 
logical relation to other memory locations allocated to the 
data structure. Therefore, traversing the data structure gen- 
erally does not involve regular intervals between the ele- 
ments. A stride-based prefetch algorithm would have a low 
likelihood of prefetching the correct memory locations for 
such a dynamically allocated data structure. Other prefetch 
algorithms have similar difficulties, as the pattern of 
accesses is ill -defined. As used herein, a "memory block" 
comprises one or more contiguous bytes of memory allo- 
cated in response to a dynamic memory allocation request. 

SUMMARY OF THE INVENTION 

The problems outlined above are in large part solved by 
a dynamic memory allocation routine in accordance with the 
present invention. The dynamic memory allocation routine 
maintains an allocation size cache which records the address 
of a most recently allocated memory block for each different 
size of memory block that has been allocated. Upon receiv- 
ing a dynamic memory allocation request, the dynamic 
memory allocation routine determines if the requested size 
is equal to one of the sizes recorded in the allocation size 
cache. If a matching size is found, the dynamic memory 
allocation routine attempts to allocate a memory block 
contiguous to the most recently allocated memory block of 
that matching size. If the contiguous memory block has been 
allocated to another memory block, the dynamic memory 
allocation routine attempts to reserve a reserved memory 
block having a size which is a predetermined multiple of the 
requested size. The requested memory block is then allo- 
cated at the beginning of the reserved memory block. By 
reserving the reserved memory block, the dynamic memory 
allocation routine may increase the likelihood that subse- 
quent requests for memory blocks having the requested size 
can be allocated in contiguous memory locations. Upon 
allocating a memory block in response to a dynamic 
memory allocation request, the dynamic memory allocation 
routine updates the size allocation cache to reflect the 
allocation. 

Advantageously, elements of a dynamic memory structure 
(e.g. a dynamic data structure) may be allocated memory 
which is contiguous to other elements of the data structure, 
A stride-based data prefetch mechanism may thereby more 
accurately predict addresses to be fetched when the dynamic 
data structure is repeatedly accessed (e.g. to traverse the 
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dynamic data structure). Performance of computer programs FIG. 8 is a second example of a dynamically allocated 

which use dynamic data structures may be improved when data structure according to one embodiment of the heap 

executing upon a computer system employing the dynamic management routine. 

memory allocation routine described herein. FIG. 9 is a block diagram of one embodiment of a 

The dynamic memory allocation routine described herein 5 computer system including the microprocessor shown in 

takes advantage of the characteristics exhibited by many ^IG. 1. 

programs employing dynamic data structures. Often, these While the invention is susceptible to various modifica- 

programs may employ several dynamic data structures. Each uons and alternative forms, specific embodiments thereof 

data structure generally includes data elements having a are shown b y wa Y of example in the drawings and will 

fixed size, but the size of the data elements in different data 10 herein be described in detail. It should be understood, 

structures may often differ. Therefore, memory allocation however, that the drawings and detailed description thereto 

requests for data elements of a particular size may typically are not intended to limit the invention to the particular form 

be requests corresponding to data elements within the same disclosed, but on the contrary, the intention is to cover all 

data structure. Allocating contiguous memory to data clc- modifications, equivalents and alternatives falling within the 

ments having a particular size may thereby lead to regular 1 5 s P iril and SC0 P e of the present invention as defined by the 

access patterns when accessing these elements within the appended claims. 

corresponding dynamic data structure. In this manner, DETAILED DESCRIPTION OF THE 

stride-based prefetching may become more useful in access- INVENTION 

ing dynamic data structures. # i . . . e . ,. 

* , . . , on Turning now to FIG. 1, a block diagram of one embodi- 

Broadly speaking, the present invention contemplates a ffiem of % microprocessor 10 is shown. Microprocessor 10 

method for dynamic memory allocation in a computer incMes an instmction ^che 12j a data cache 14, a decode 

system. A first request for dynamic allocation of a first ^ M a ^ of reservallon stations includin reser . 

memory block including a first number of bytes is received. vatk)n gtations 1?A> 1JB and ^ a ^ of 

The first memory block is allocated at a first address unils deluding execute units 18A and 18B, a load/store unit 

succeeding a second address corresponding to a last byte of 20 a reorder bufifer 22 a ^ file 24 a glride ^ 

a previously allocated memory block having the first number ^ 26 and a microcode unit 28 , Elements referred t0 hereirj 

of bytes. Alternatively, the first memory block is allocated at ^ a ^ reference number foUowed b a ^ ^ 

a third address if the previously allocated memory block has be ^necMvrty referred to by the reference number alone, 

a second number of bytes not equal to the first number of ^ For example> me plura[ity of execute units wiu be 

y es ' tively referred to herein as execute units 18. Execute units 18 

The present invention further contemplates a computer may include more execute units than execute units 18A and 

storage medium configured to store a dynamic memory 18B shown in FIG j Additionally, an embodiment of 

management routine which, in response to a first request for microprocessor 10 may include one execute unit 18. 

a dynamic allocation of a first memory block having a first ^ shown in pjQ h inslnlct ; 0 n cache 12 is coupled to a 

number of bytes: (i) allocates the first memory block at a first mai|1 m su b S y S t 6 m (not shown) and to a decode unit 

address contiguous to a second memory block having the 16> which ^ kd t0 reservation stations 17> 

first number of bytes; or (u) allocates the first memory block reorder buffer 22> register file 24, and microcode unit 28. 

at a second address discontiguous to the second memory Reord6r buffcr 22> 6xecutc udts 18 and data cadl6 14 are 

block if the second memory block has a second number of cach coupkd to a r6SU „ bus 30 for forwarding of execution 

bytes not equal to the first number of bytes. results Furthermore> each reservation station 17A and 17B 

BRIEF DESCRIPTION OF THE DRAWINGS is coupled to a respective execute unit 18A and 18B, while 

Other objects and advantages of the invention will reservation station 17C is coupled to load/store unit 20. Each 

, J 4 fll . . . ., , reservation station 17 is coupled to receive operand mfor- 

become apparent upon reading the following detailed c , , a -I r A i « 

i > j r . .1. . * ac mation from reorder bufter 22. Load/store unit 20 is coupled 

description and upon reference to the accompanying draw- ™ . , . , ■% a «. i j * *u • 

incs in which- cache 14, which is further coupled to the main 

c r memory subsystem. Additionally, stride prefetch unit 26 is 

FIG. 1 is a block diagram of one embodiment of a M [Q load/store unit 20 and data cachc 14 

microprocessor include a stndebased prefetch unit. A n i • • ^-ij ... 

mX * • a t .,, . . c .i. Generally speaking, microprocessor 10 includes stnde 

FIG 2 is a flowchart illiistrating operation of one embodi- 5Q efctch ^ 26 fof rformi stride . based prefetching, 

meat of the stnde-based prefetch unit shown in FIG. 1. Q ^ em5odiments of microp rocessor 10 may include 

FIG. 3 is a diagram illustrating division of a memory prefetch units employing a variety of other prefetch algo- 

space according to one embodiment of an operating system rithms Stride prefetch unit 26 monitors the cache accesses 

executed in a computer system including the microprocessor performed by load/store unit 20 in order to learn the stride 

shown in FIG. 1. 55 ^ or str j des ) between accesses. Additionally, stride prefetch 

FIG. 4 is a block diagram illustrating one embodiment of un it 26 monitors the accesses in order to generate prefetch 

a heap management routine and data structures maintained accesses. According to one embodiment, stride prefetch unit 

thereby. 26 prefetches an address which is the sum of an access 

FIG. 5 is a flowchart illustrating dynamic memory alio- address provided by load/store unit 20 to data cache 14 and 

cation according to one embodiment of the heap manage- 60 a stride learned by stride prefetch unit 26 from previous 

ment routine. access addresses. The prefetch address is provided to data 

FIG. 6 is a flowchart illustrating dynamic memory deal- cache 14 to determine if a hit occurs. If a miss occurs in data 

location according to one embodiment of the head manage- cache 14, the prefetch address is forwarded to the main 

ment routine. memory subsystem for retrieving the corresponding cache 

FIG. 7 is a first example of a dynamically allocated data 65 fine from main memory, 

structure according to one embodiment of the heap man- Stride prefetch unit 26 may be configured to detect 

agement routine. multiple strides corresponding to different streams of data 
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accesses. Stride prefetch unit 26 may differentiate the dif- units 18 are symmetrical execution units. Symmetrical 

ferent streams by limiting the maximum stride between execution units are each configured to execute a particular 

addresses. If the stride between two consecutive addresses subset of the instruction set employed by microprocessor 10. 

exceeds the maximum stride, then the two addresses are The subsets of the instruction set executed by each of the 

assumed to be from different streams of data accesses. 5 symmetrical execution units are the same. In another 

Alternatively, the type of load/store memory operation per- embodiment, execute units 18 are asymmetrical execution 

formed to generate the addresses may differentiate streams. units configured to execute dissimilar instruction subsets. 

For example, the size of the data being accessed (e.g. byte, For example, execute units 18 may include a branch execute 

word, doubleword, etc.) may be the same for accesses within unit for executing branch instructions, one or more 

a stream but different between accesses which belong to 1Q arithmetic/logic units for executing arithmetic and logical 

different streams. Similarly, sign extension/zero extension instructions, and one or more floating point units for execut- 

and other similar properties of the memory operations may ing floating point instructions. Decode unit 16 dispatches an 

be used to differentiate streams. instruction to a reservation station 17 which is coupled to an 

According to one embodiment, the dynamic memory execute unit 18 or load/store unit 20 which is configured to 

allocation routine employed within a computer system 1S execute that instruction. 

including microprocessor 10 uses a memory allocation algo- Microcode unit 28 is included for handling instructions 

rithm intended to improve the effectiveness of stride for which the architecturally defined operation is more 

prefetch unit 26. The dynamic memory allocation routine complex than the hardware employed within execute units 

maintains an allocation size cache which records the address 18 and load/store unit 20 may handle. Microcode unit 28 

of the most recently allocated memory block for each size of 20 parses the complex instruction into multiple instructions 

memory block that has been allocated. If a request for which execute units 18 and load/store unit 20 are capable of 

allocation of a memory block is received and the requested executing. Additionally, microcode unit 28 may perform 

size equals one of the sizes recorded in the allocation size functions employed by microprocessor 10. For example, 

cache, then the dynamic memory allocation routine attempts microcode unit 28 may perform instructions which represent 

to allocate memory contiguous to the previously allocated 25 a context switch. Generally speaking, the "context" of a 

memory block of that size. Advantageously, memory blocks program comprises the state needed to correctly run that 

of the same size may often be allocated in contiguous program. Register values created by the program are 

memory locations. If the memory blocks are part of the same included in the context, as are the values stored in any 

data structure, a traversal of the data structure may be memory locations used by the program. Microcode unit 28 

correctly prefetched using stride based prefetching. For 30 causes the context stored within microprocessor 10 to be 

example, a linked list of elements to which elements are saved to memory at a predefined memory location 

usually added at one end of the list may receive contiguous (according to the microprocessor architecture employed by 

memory allocations for the added elements. As the list is microprocessor 10) and restores the context of the program 

traversed (a relatively common operation in linked lists), being initiated. Context switches may occur in response to 

each element will often be at a fixed stride away from the 35 an interrupt being signalled to microprocessor 10, for 

previous element. Therefore, prefetching based on the stride example. 

may cause each element in the list to be prefetched. Using Load/store unit 20 provides an interface between execute 
a dynamic memory allocation algorithm as described herein units 18 and data cache 14. Load and store memory opera- 
may thereby improve prefetch effectiveness for dynamic tions are performed by load/store unit 20 to data cache 14. 
data structures. As used herein, the term routine refers to a 40 Additionally, memory dependencies between load and store 
series of instructions arranged to perform a particular func- memory operations are detected and handled by load/store 
tion when executed upon microprocessor 10 or another unit 20. Generally speaking, a "memory operation" is per- 
microprocessor which is configured to execute the instruc- formed to transfer data between the main memory and 
tion set defining the instructions. microprocessor 10. A load memory operation specifies the 

Instruction cache 12 is a high speed cache memory for 45 transfer of data from one or more memory locations within 

storing instructions. It is noted that instruction cache 12 may the main memory to microprocessor 10. On the other hand, 

be configured into a set-associative or direct mapped con- a store memory operation specifies the transfer of data from 

figuration. Instruction cache 12 may additionally include a microprocessor 10 to one or more memory locations within 

branch prediction mechanism for predicting branch instruc- the main memory. The memory location or locations 

tions as either taken or not taken. Instructions are fetched 50 accessed by a given memory operation are identified within 

from instruction cache 12 and conveyed to decode unit 16 the main memory by an address corresponding to the given 

for decode and dispatch to a reservation station 17. memory operation. 

Decode unit 16 decodes each instruction fetched from Reservation stations 17 are configured to store instruc- 

instruction cache 12. Decode unit 16 dispatches the instruc- tions whose operands have not yet been provided. An 

tion to one or more of reservation stations 17 depending 55 instruction is selected from those stored in a reservation 

upon the type of instruction detected. More particularly, station 17A-17C for execution if: (1) the operands of the 

decode unit 16 produces a decoded instruction in response to instruction have been provided, and (2) the instructions 

each instruction fetched from instruction cache 12. The within the reservation station 17A-17C which are prior to 

decoded instruction comprises control signals to be used by the instruction being selected in program order have not yet 

execute units 18 and/or load/store unit 20 to execute the 60 received operands. It is noted that a centralized reservation 

instruction. For example, if a given instruction includes a station may be included instead of separate reservations 

memory operand, decode unit 16 may signal load/store unit stations. The centralized reservation station is coupled 

20 to perform a load/store (i.e. read/write) memory opera- between decode unit 16, execute units 18, and load/store unit 

tion in response to the given instruction. 20. Such an embodiment may perform the dispatch function 

Decode unit 16 also detects the register operands used by 65 within the centralized reservation station, 

the instruction and requests these operands from reorder Microprocessor 10 supports out of order execution, and 

buffer 22 and register file 24. In one embodiment, execute employs reorder buffer 22 for storing execution results of 
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speculatively executed instructions and storing these results example, instruction fetches which miss instruction cache 12 
into register file 24 in program order, for performing depen- may be transferred from a main memory by the main 
dency checking and register renaming, and for providing for memory subsystem. Similarly, data requests performed by 
mispredicted branch and exception recovery. When an load/store unit 20 which miss data cache 14 may be trans- 
instruction is decoded by decode unit 16, requests for 5 ferred from main memory by the main memory subsystem, 
register operands are conveyed to reorder buffer 22 and Additionally, data cache 14 may discard a cache line of data 
register file 24. In response to the register operand requests, which has 5een modified by microprocessor 10. The main 

one of three values is transferred to the reservation station c „w^, ctam „ OB(> f flf . P —ash-a k„« tn «u A 

* ** a i_ • . • /-.xl i j memory subsystem transfers the modified line to the mam 

17A-17C which receives the instruction: (1) the value stored me morv 

in reorder buffer 22, if the value has been speculatively ™ 

generated; (2) a tag identifying a location within reorder 0 11 15 noted that decodc unit 16 ma Y be configured to 

buffer 22 which will store the result, if the value has not been dispatch an instruction to more than one execution unit. For 

speculatively generated; or (3) the value stored in the example, in embodiments of microprocessor 10 which 

register within register file 24, if no instructions within employ the x86 microprocessor architecture, certain instruc- 

reorder buffer 22 modify the register. Additionally, a storage tions may operate upon memory operands. Executing such 

location within reorder buffer 22 is allocated for storing the 15 an instruction involves transferring the memory operand 

results of the instruction being decoded by decode unit 16. from data cache 14, executing the instruction, and transfer- 

The storage location is identified by a tag, which is conveyed ring the result to memory (if the destination operand is a 

to the unit receiving the instruction. It is noted that, if more memory location) or data cache 14. Load/store unit 20 

than one reorder buffer storage location is allocated for performs the memory operations, and an execute unit 18 

storing results corresponding to a particular register, the 20 performs the execution of the instruction, 

value or tag corresponding to the last result in program order ~ . . ~ T/ ^ ~ a « , . .» 

is conveyed in response tL register operand request for that f Turmn S *™ t0 HG. 2 a flowchart illustrating operation 

particular register °J ° ne embodiment of stride prefetch unit 26 is shown. 

When execute units 18 or load/store unit 20 execute an f lde l * iT ™k?T,? 

instruction, the tag assigned to the instruction by reorder 25 dat < i cach * 14 \&*? ^ ^ustrated by decision block 42, 

buffer 22 is conveyed upon result bus 30 along with the stnde P refetch umt 26 determines if it has a recorded stride 

result of the instruction. Reorder buffer 22 stores the result corresponding to the access. As described above, various 

in the indicated storage location. Additionally, reservation cntena ma y be used to determine if an access is within a 

stations 17 compare the tags conveyed upon result bus 30 stream of accesses corresponding to a particular stride. If no 

with tags of operands for instructions stored therein. If a 3Q recorded stride corresponds to the access, stride prefetch unit 

match occurs, the unit captures the result from result bus 30 26 allocates a stride for the access and attempts to learn the 

and stores it with the corresponding instruction. In this stride (step 44). For example, stride prefetch unit 26 may 

manner, an instruction may receive the operands it is record the access address and await another access which is 

intended to operate upon. Capturing results from result bus determined to be within the same stream of data accesses. 

30 for use by instructions is referred to as "result forward- The stride may then be calculated from the addresses of the 

ing". two accesses. Stride prefetch unit 26 is configured to track 

Instruction results are stored into register file 24 by al least one stride, and may optionally be configured to track 
reorder buffer 22 in program order. Storing the results of an a Predefined number of additional strides, 
instruction and deleting the instruction from reorder buffer If the access detected from load/store unit 20 does cor- 
22 is referred to as "retiring" the instruction. By retiring the 40 respond to a recorded stride, stride prefetch unit 26 generates 
instructions in program order, recovery from incorrect a prefetch address by adding the access address and the 
speculative execution may be performed. For example, if an corresponding stride (step 46). Stride prefetch unit 26 con- 
instruction is subsequent to a branch instruction whose veys the prefetch address to data cache 14 to determine if the 
taken/not taken prediction is incorrect, then the instruction prefetch address hits in the cache. If the prefetch address 
may be executed incorrectly. When a mispredicted branch 45 misses, data cache 14 conveys the prefetch address to the 
instruction or an instruction which causes an exception is main memory subsystem. 

detected, reorder buffer 22 discards the instructions subse- It is noted that, in addition to learning strides and forming 

quent to the mispredicted branch instructions. Instructions prefetch addresses from the strides and subsequent access 

thus discarded are also flushed from reservation stations 17, addresses, stride prefetch unit 26 may be configured to 

execute units 18, load/store unit 20, and decode unit 16. 50 monitor cache accesses performed by load/store unit 20 to 

Register file 24 includes storage locations for each reg- determine the correctness of the prefetch addresses. If a 

ister defined by the microprocessor architecture employed prefetch address is incorrect, stride prefetch unit 26 may 

by microprocessor 10. For example, microprocessor 10 may delete the stride which generated the prefetch address and 

employ the x86 microprocessor architecture. For such an attempt to learn a new stride. Alternatively, stride prefetch 

embodiment, register file 24 includes locations for storing 55 unit 26 may continuously update its recorded strides accord- 

the EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP ing to the pattern of accesses observed from load/store unit 

register values. 20. 

Data cache 14 is a high speed cache memory configured Turning next to FIG. 3, a diagram illustrating a memory 

to store data to be operated upon by microprocessor 10. It is address space 50 is shown. Address space 50 is divided 

noted that data cache 14 may be configured into a set- 60 according to an operating system which executes upon 

associative or direct-mapped configuration. Data cache 14 microprocessor 10 within a computer system. The operating 

allocates and deallocates storage for data in cache lines. A system divides the address space into portions for use by 

cache line is a block of contiguous bytes. The byte within the application programs (e.g. program 1 space 52), the oper- 

cache line which has the lowest numerical address is stored ating system (e.g. operating system space 54), and a space 

at an address which is aligned to a cache line boundary. 6 5 referred to as the "heap" 56. Heap 56 is the portion of the 

The main memory subsystem effects communication memory which is reserved by the operating system for 

between microprocessor 10 and devices coupled thereto. For dynamic memory allocation. When the dynamic memory 
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allocation routine is invoked in response to a memory 
allocation request, a memory block within heap 56 is allo- 
cated for use by the program performing the request. The 
dynamic memory allocation routine maintains a free list 
indicating which memory locations within heap 56 are 
currently unallocated, and selects a memory block from heap 
56 for allocation. The algorithm employed for selecting the 
memory block is described above and in more detail below. 

Turning now to FIG. 4, a block diagram of one embodi- 
ment of a dynamic memory management routine (e.g. heap 
management routine 60) and data structures maintained by 
heap management routine 60 is shown. Heap management 
routine 60 maintains an allocation size cache 62, a reserve 
list 64, and a free list 66. Allocation size cache 62 includes 
an address field 68 and a size field 70 for each entry. Reserve 
list 64 includes an address field 72, a reserve size field 74, 
and a size field 76 for each entry. Finally, free list 66 includes 
an address field 78 and a size field 80 for each entry. 

Heap management routine 60 is invoked in response to 
dynamic memory allocation requests from an application 
(i.e. non-operating system) program or from an operating 
system routine. Heap management routine 60 is preferably 
an operating system routine, but may be implemented as part 
of an application program, firmware, etc. 

Upon allocating a memory block for a particular dynamic 
memory allocation request, heap management routine 60 
updates allocation size cache 62 and free list 66. Free list 66 
is a list of addresses 78 within heap 56 which begin a 
memory block which is not currently allocated to an appli- 
cation program or operating system routine. Corresponding 
to each address 78 is a size 80 which indicates the number 
of bytes within the memory block which are free (i.e. up to 
the first byte which is: (i) subsequent to the corresponding 
address 78 and (ii) allocated to an application program or 
operating system routine). Free list 66 therefore defines the 
portions of heap 56 which are available to satisfy dynamic 
memory allocation requests. Upon allocation of a memory 
block, free list 66 is updated to remove the memory block 
from the available area. If the allocated memory block is 
smaller than the memory block within free list 66 which 
contains the allocated memory block, the address is 
increased (and the size decreased) to remove the allocated 
memory block from free list 66. Alternatively, a memory 
block of exactly the requested size may be within free list 66, 
in which case the entry corresponding to the memory block 
may be deleted. 

Allocation size cache 62 records the address of the most 
recently allocated memory block for each size of memory 
block that has been allocated. The address is recorded in 
address field 68 of an entry, while the size of the memory 
block is recorded in the corresponding size field 70. Upon 
allocating a memory block, heap management routine 60 
either creates a new entry in allocation size cache 62 (if the 
requested size is not associated with a recorded entry 
already) or overwrites the address field 68 corresponding to 
the requested size recorded within a size field 70. In this 
manner, the allocation size cache 62 indicates which 
memory block was allocated for the most recent request for 
a memory block of a given size. 

Heap management routine 60 attempts to allocate a 
memory block contiguous to a previously allocated memory 
block of the same size. Since dynamic data structures are 
often built using elements of a consistent size, elements of 
the dynamic data structure may be allocated in contiguous 
storage locations. As elements are added to a data structure, 
then, the elements may frequently be in contiguous memory 
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locations and therefore at a fixed stride from other elements 
within the data structure. 

If the memory block contiguous to the previously allo- 
cated memory block is not available (determined by exam- 

5 ining free list 66), then heap management routine 60 
attempts to reserve a memory block which is a predeter- 
mined multiple of the requested size. The requested memory 
block is allocated at the beginning of the reserved memory 
block. The remainder of the reserved memory block is 

10 reserved for other requests for a memory block of the same 
size as the allocated memory block. In other words, heap 
management routine 60 attempts to allocate subsequent 
memory blocks having a size different than the allocated 
memory block outside of the reserved memory block (unless 

15 no other locations within the heap can satisfy the request). 
Reserve list 64 is used to record the reserved memory 
blocks. The reserved memory block is not removed from 
free list 66 until actually allocated to subsequent requests. In 
this manner, the reserved memory block is available if 

20 unreserved portions of free list 66 are completely allocated 
and another dynamic memory allocation request is received. 
Additionally, if an entry in allocation size cache 62 is found 
for a given dynamic memory request, the subsequent loca- 
tions may be more likely to be allocatable since the subse- 

25 quent locations are reserved. Each entry in reserve list 64 
indicates the address at the beginning of the reserve block 
(address field 72), the size of the reserved block (reserve size 
field 74), and the size of the memory blocks for which the 
reserved memory block is reserved (size field 76). Upon 

30 allocating a memory block for which no matching entry is 
found in allocation size cache 62, heap management routine 
60 allocates a memory block from free list 66 (preferably 
outside of the reserved memory blocks) and reserves a 
reserved memory block for requests of the size of the 

35 memory block (if possible). The reserved memory block 
includes the allocated memory block, and is recorded within 
reserve list 64. 

The size of the reserve memory block (i.e. the predeter- 

40 mined multiple of the request size) may be related to the 
expected number of dynamic allocation requests to be 
received for that sized memory block. For example, a 
multiple within the range of 20-100 may be suitable. The 
multiple may depend upon the requested size. In particular, 

45 it may be advantageous to reserve a larger number of small 
memory blocks while reserving a smaller number of large 
memory blocks. Reserving a large number of large memory 
blocks may quickly occupy a large amount of the heap, 
while a large number of smaller memory blocks may be less 

5Q susceptible to this problem. 

It is noted that, instead of employing reserve list 64, heap 
management routine 60 may allocate the reserved memory 
block from free list 66. Heap management routine 60 would 
then maintain a list of allocated reserve memory blocks and 

55 allocate memory from within the allocated reserve memory 
blocks for subsequent dynamic memory allocation requests 
of the size corresponding to the reserve memory blocks. 
Other dynamic memory allocation requests may be satisfied 
with memory allocated from free list 66. 

60 Turning now to FIG. 5, a flowchart illustrating the opera- 
tion of one embodiment of heap management routine 60 in 
response to a dynamic memory allocation request is shown. 
Heap management routine 60 compares the requested size to 
the sizes recorded in allocation size cache 62 (decision block 

65 90). If the requested size equals one of the sizes recorded in 
allocation size cache 62, the heap management routine 60 
attempts to allocate the memory block contiguous to the 
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previously allocated memory block of thai size (e.g. by Turning next to FIG. 6, a flowchart illustrating operation 
adding the address of the previously allocated memory block of one embodiment of heap management routine 60 upon 
from address field 68 to the requested size). Heap manage- receiving a dynamic memory deallocation request is shown, 
ment routine 60 checks free list 66 to determine the avail- Dynamic memory deallocation is often referred to as "free- 
ability of the succeeding memory locations (i.e. the contigu- 5 ing" memory. For example, in the "C" programming 
ous memory block) (decision block 92). If the contiguous language, dynamic memory allocation may be accomplished 
memory block is available, then heap management routine using a "malloc" function call while dynamic memory 
60 allocates the contiguous memory block (step 94). Heap deallocation may be accomplished using a "free" function 
management routine 60 updates free list 66 to remove the call. 

allocated memory block therefrom, and updates allocation 10 Heap management routine 60 determines if the deallo- 
size cache 62 to reflect the address of the allocated memory catc d memory block is within a reserved memory block by 
block (i.e. overwrites address field 68 of the corresponding examining reserve list 64 (decision block U0). If the deal- 
entry with the address of the allocated memory block). located memory block is not within a reserved block, heap 
If, on the other hand, the contiguous memory block is not management routine 60 updates free list 66 to reflect the 
available, heap management routine 60 attempts to allocate 15 freed memory (step 112). On the other hand, if the deallo- 
a reserve memory block for memory blocks of the requested cated memory block is within a reserved block, heap man- 
size (decision block 96). Heap management routine 60 agement routine 60 determines if the entire reserved 
searches free list 66 for a memory block having a size equal memory block has been deallocated (decision block 114). If 
to or greater than the predetermined multiple of the request the entire reserved memory block has been deallocated, the 
size. If a reserve block is located, the requested memory 20 entry corresponding to the reserved memory block is deleted 
block is allocated at the beginning of the reserve memory from reserve list 64 (step 116). In either case, free list 66 is 
block, and free list 66 is updated according to the requested updated to reflect the freed memory, 
size (step 98). Additionally, reserve list 64 is updated to Turning next to FIG. 7, a first example of a dynamically 
indicate the size of the reserved memory block. The remain- allocated data structure performed using one embodiment of 
der of the reserved memory block (i.e. not including the 25 h ea p management routine 60 is shown. The dynamically 
allocation of the requested memory block) is not removed allocated data structure shown in FIG. 7 is a linked list. A 
from free list 66 to facilitate usage of these memory loca- linked list is a data structure in which each element in the list 
tions by subsequent blocks of the requested size (which may points to at least one other element in the list. Typically, the 
allocate the contiguous memory block without regard to jj st elements are equal in size. A head pointer is used to 
reserve list 64), and to facilitate usage of these memory 30 identify the first element in the list. The first element in the 
locations for any memory request if the remainder of the ij st (j n addition to storing an item in the list) points to the 
heap becomes allocated. Still further, the corresponding second element, which in turn points to the third element, 
entry within allocation size cache 62 is updated with the etc . Elements may be added to the head of the list, the tail 
address of the newly allocated memory block (overwriting 0 f t h e list, or within the list. For the remainder of this 
the address of the previously allocated memory block within 35 example, addresses will be expressed in hexadecimal for- 
address field 68). ma t 

If a memory block suitable for the reserve memory block At reference numeral 120, the exemplary linked list is 

is not located within free list 66, heap management routine illustrated at a first point in time in which the list has one 

60 allocates the requested memory block, updates free list 66 element 122 (allocated at an address 1000). The head pointer 

to indicate the allocation, and updates allocation size cache 0 f the list points to element 122 (i.e. the head pointer has a 

62 with the address of the allocated memory block value of 1000). During the dynamic memory allocation 

(overwriting the previous address corresponding to the request for element 122, heap management routine 60 

requested size) (step 100). recorded the address 1000 and the size of element 122 in 

Returning to decision block 90, if a requested size is not 45 allocation size cache 62. 

found within allocation size cache 62, heap management At reference numeral 124, the exemplary linked list is 

routine 60 attempts to allocate the requested memory block s h 0 wn after several elements 126, 128, and 130 have been 

outside of any reserve spaces listed in reserve list 64, if a dded. As each element 126-130 is added, heap manage- 

possible (step 102), If it is not possible to allocate the ment routine 60 attempts to allocate memory contiguous to 

requested size outside of the reserved memory blocks, heap 5Q fa c previous allocation. In the present example, elements 

management rouline 60 allocates a memory block within 122 and 126-130 each include 16 bytes (10 in hexadecimal 

one of the reserved memory blocks. Heap management notation). Therefore, heap management routine 60 allocates 

routine 60 may then delete the affected reserved memory memory blocks beginning at addresses 1010, 1020, and 

block from reserve list 64. Additionally, heap management 1030. Additionally, heap management routine 60 in the 

routine 60 may resort to a "first fit" or "best fit" approach to 55 prcscnt example reserves a memory block of 256 bytes (16 

memory allocation in step 102 if allocation outside of the times the size of the elements ... 16 bytes each). As 

reserved memory blocks is not possible. illustrated at reference numeral 132, elements are success- 

Upon allocation, free list 66 and allocation size cache 62 fully allocated at contiguous memory locations through 

are updated with respect to the allocated memory block. If element 134 at address 10F0. The next allocation of an 

desired, a reserve memory block corresponding to the 60 element 136 occurs at an address 2000 in the example, 

requested size may be formed at step 102 as well, and Another memory block is reserved, thereby allowing ele- 

reserve list 64 may correspondingly be updated. ment 138 to be allocated contiguous to element 136. It is 

It is noted that, although the steps shown in the flowchart noted that any multiple of the element size may be selected 

of FIG. 5 and other flowcharts herein are shown sequentially and l hat tne size selected for this example is for exemplary 

for allowing understanding, the flowcharts may be imple- 65 purposes only. 

mented using any set of steps which accomplishes the same Traversing the linked list shown in FIG. 7 comprises 

operation. multiple accesses at a fixed stride from each other (e.g. 
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elements 122 and 126-130 shown at reference numeral 124 
and elements 122, 126-130, and 134 shown at reference 
numeral 132). Traversing the linked list may therefore be 
successfully prefetched using a stride-based prediction 
method such as that employed by stride prefetch unit 26. At 
element 136, the prefetch may fail but may subsequently 
resume correct predictions beginning with element 138 or a 
subsequent element. The prefetch accuracy in general may 
be substantially higher than that achievable with a heap 
management algorithm which does not attempt to allocate 
like-sized memory blocks contiguously. 

Turning now to FIG. 8, a second example of a dynami- 
cally allocated data structure performed using one embodi- 
ment of heap management routine 60 is shown. The dynami- 
cally allocated data structure shown in FIG. 8 is again a 
linked list. For the remainder of this example, addresses will 
be expressed in hexadecimal format. 

At reference numeral 140, the exemplary linked list is 
shown having elements 142, 144, 146, 148, 150, 152, and 
154. Each element is separated from the subsequent element 
by an equal stride amount. Therefore, stride based prefetch- 
ing may successfully fetch each of the items in the list. 

At reference numeral 156, the exemplary linked list is 
shown after deallocating element 148 from the list. 
Unfortunately, the stride between elements 146 and 150 is 
no longer equal to the stride between the other elements. 
Additionally, at reference numeral 158, the exemplary list 
shown at reference numeral 156 is shown with a new 
element 160 inserted into the interior of the list. Again, the 
fixed stride distance between subsequent elements within the 
list is interrupted. Traversing the lists shown at reference 
numerals 156 and 158 may lead to several mispredicted 
prefetches. Additional insertions and deletions within the list 
may lead to additional discontinuities. 

Fortunately, an application program may take advantage 
of the properties of heap management routine 60 to correct 
the discontinuities in the list. The application program may 
simply rebuild the list, (i.e. allocate elements beginning at 
the head of the list and copy the contents of the current 
elements of the list into the new list). In this manner, the 
properties of heap management routine 60 may result in a 
list which again exhibits fixed strides between the elements. 

Turning now to FIG. 9, a computer system 200 including 
microprocessor 10 is shown. Computer system 200 further 
includes a bus bridge 202, a main memory 204, and a 
plurality of input/output (I/O) devices 206A-206N. Plurality 
of I/O devices 206 A-206N will be collectively referred to as 
I/O devices 206. Microprocessor 10, bus bridge 202, and 
main memory 204 are coupled to a system bus 208. I/O 
devices 206 are coupled to an I/O bus 210 for communica- 
tion with bus bridge 202. Additionally shown in FIG. 9 is a 
computer storage medium 212 coupled to I/O bus 210. 
Alternatively, computer storage medium 212 may be 
coupled to system bus 208. 

Generally, a computer storage medium is a storage 
medium upon which computer code and/or data may be 
stored. The code and/or data may be stored in a non-volatile 
fashion, such as upon a hard disk drive, a compact disk - read 
only memory (CD-ROM), flash memory, or other non- 
volatile storage. Alternatively, the storage may be volatile 
such as a dynamic random access memory (DRAM) or static 
RAM (SRAM) storage. 

Main memory 204 may be an example of a volatile 
storage. In one embodiment, computer storage medium 212 
is configured to store at least heap management routine 60. 
Heap management routine 60 may be loaded into main 
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memory 204 and executed, and allocation size cache 62, 
reserve list 64, and free list 66 may be maintained in main 
memory 204 as well. 
Bus bridge 202 is provided to assist in communications 

5 between I/O devices 206 and devices coupled to system bus 
208. I/O devices 206 typically require longer bus clock 
cycles than microprocessor 10 and other devices coupled to 
system bus 208. Therefore, bus bridge 202 provides a buffer 
between system bus 208 and input/output bus 210. 

10 Additionally, bus bridge 202 translates transactions from 
one bus protocol to another. In one embodiment, input/ 
output bus 210 is an Enhanced Industry Standard Architec- 
ture (EISA) bus and bus bridge 202 translates from the 
system bus protocol to the EISA bus protocol. In another 

15 embodiment, input/output bus 210 is a Peripheral Compo- 
nent Interconnect (PCI) bus and bus bridge 202 translates 
from the system bus protocol to the PCI bus protocol. It is 
noted that many variations of system bus protocols exist. 
Microprocessor 10 may employ any suitable system bus 

20 protocol. 

I/O devices 206 provide an interface between computer 
system 200 and other devices external to the computer 
system. Exemplary I/O devices include a modem, a serial or 
parallel port, a sound card, etc. I/O devices 206 may also be 

25 referred to as peripheral devices. Main memory 204 stores 
data and instructions for use by microprocessor 10. In one 
embodiment, main memory 204 includes at least one 
Dynamic Random Access Memory (DRAM) and a DRAM 

^ memory controller. 

It is noted that although computer system 200 as shown in 
FIG. 9 includes one bus bridge 202, other embodiments of 
computer system 200 may include multiple bus bridges 202 
for translating to multiple dissimilar or similar I/O bus 

35 protocols. Still further, a cache memory for enhancing the 
performance of computer system 200 by storing instructions 
and data referenced by microprocessor 10 in a faster 
memory storage may be included. The cache memory may 
be inserted between microprocessor 10 and system bus 208, 

40 or may reside on system bus 208 in a "lookaside" configu- 
ration. It is still further noted that the functions of bus bridge 
202, main memory 204, and the cache memory may be 
integrated into a chipset which interfaces to microprocessor 
10. It is still further noted that the present discussion may 

45 refer to the assertion of various signals. As used herein, a 
signal is "asserted" if it conveys a value indicative of a 
particular condition. Conversely, a signal is "deasserted" if 
it conveys a value indicative of a lack of a particular 
condition. A signal may be defined to be asserted when it 

5Q conveys a logical zero value or, conversely, when it conveys 
a logical one value. 

In accordance with the above disclosure, a computer 
system has been shown which includes a dynamic memory 
allocation routine which attempts to allocate memory in a 

55 manner optimized for prefetching. The dynamic memory 
allocation routine attempts to allocate memory blocks of 
equal size in contiguous memory locations, thereby allowing 
a stride-based prefetch algorithm to achieve success when 
traversing a dynamically allocated data structure built using 

60 like -sized elements. Advantageously, performance may be 
increased through the successful prefetch of data within 
dynamic data structures. 

Numerous variations and modifications will become 
apparent to those skilled in the art once the above disclosure 

65 is fully appreciated. It is intended that the following claims 
be interpreted to embrace all such variations and modifica- 
tions. 
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What is claimed is: 

1. A method for dynamic memory allocation in a com- 
puter system, comprising: 

receiving a first request for dynamic allocation of a first 
memory block including a first number of bytes; 5 

allocating said first memory block at a first address 
succeeding a second address corresponding to a last 
byte of a previously allocated memory block having 
said first number of bytes; 

allocating said first memory block at a third address if said 
previously allocated memory block has a second num- 
ber of bytes not equal to said first number of bytes; and 

reserving a second memory block beginning at said third 
address and including a third number of bytes equal to 15 
a predetermined multiple of said first number of bytes. 

2. The method as recited in claim 1 wherein said reserving 
comprises storing said third address, said first number of 
bytes, and said third number of bytes in a reserve list. 

3. The method as recited in claim 2 further comprising 2 o 
receiving a second request for a dynamic allocation of a third 
memory block. 

4. The method as recited in claim 3 further comprising 
allocating said third memory block at a fourth address 
succeeding a fifth address corresponding to a last byte of 2 s 
said first block if said third memory block has said first 
number of bytes. 

5. The method as recited in claim 3 further comprising 
allocating said third memory block at a sixth address outside 

of said second memory block if said third memory block has 30 
a fourth number of bytes not equal to said first number of 
bytes. 

6. The method as recited in claim 2 further comprising 
deallocating said first memory block upon receiving a deal- 
location request for said first memory block. 35 

7. The method as recited in claim 6 further comprising 
releasing a reservation for said second memory block upon 
said deallocating said first memory block. 

8. The method as recited in claim 7 further comprising 
deleting said third address, said first number of bytes, and 40 
said third number of bytes from said reserve list upon 
deallocating said first memory block. 

9. The method as recited in claim 1 further comprising 
maintaining an allocation size cache indicating which 
addresses were previously allocated for memory blocks 45 
having different numbers of bytes. 

10. The method as recited in claim 1 further comprising 
comparing said first number of bytes to said different num- 
bers of bytes to determine if said previously allocated 
memory block has said first number of bytes. 50 

U. A computer storage medium configured to store a 
dynamic memory management routine which, in response to 
a first request for a dynamic allocation of a first memory 
block having a first number of bytes: 

allocates said first memory block at a first address con- 55 
tiguous to a second memory block having said first 
number of bytes; 
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allocates said first memory block at a second address 
discontiguous to said second memory block if said 
second memory block has a second number of bytes not 
equal to said first number of bytes; and 

reserves a third memory block beginning at said second 
address and having a third number of bytes equal to a 
predetermined multiple of said first number of bytes. 

12. The computer storage medium as recited in claim 11, 
wherein said dynamic memory management routine 
reserves said third memory block by storing said second 
address and said third number of bytes in a reserve list. 

13. The computer storage medium as recited in claim 12, 
wherein said dynamic memory management routine further 
stores said first number of bytes in said reserve list. 

14. The computer storage medium as recited in claim 11, 
wherein said dynamic memory management routine records 
a third address corresponding to said second memory block 
in an allocation size cache. 

15. The computer storage medium as recited in claim 14, 
wherein said dynamic memory management routine records 
said first address in said allocation size cache. 

16. A computer storage medium configured to store a 
dynamic memory management routine which, in response to 
a first request for a dynamic allocation of a first memory 
block having a first number of bytes: 

allocates said first memory block at a first address con- 
tiguous to a second memory block having said first 
number of bytes; 

allocates said first memory block at a second address 
discontiguous to said second memory block if said 
second memory block has a second number of bytes not 
equal to said first number of bytes; 

records a third address corresponding to said second 
memory block in an allocation size cache; 

records said first address in said allocation size cache; and 

overwrites said third address in said allocation size cache 
if said second memory block has said first number of 
bytes. 

17. The computer storage medium as recited in claim 15, 
wherein said dynamic memory management routine stores 
said first address in addition to said third address within said 
allocation size cache if said second memory block has said 
second number of bytes. 

18. The computer storage medium as recited in claim 15, 
wherein said dynamic memory management routine further 
records a number of bytes corresponding to each address in 
said allocation size cache, whereby said dynamic memory 
management routine determines if said first memory block 
and said second memory block comprise equal numbers of 
bytes. 

***** 
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